image.png

Data Science and Business Analytics

Practice Project I

Prediction of Diabetes Patient Based on Diagnostic Measurements

By

Hayford Osumanu

December 2022

image.png

Artificial Intelligence and Machine Learning is the Oxgyen to Today's Business

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Executive Summary

The primary objective of this study is to build machine learning or classification model that will help to predict diabetes patient based on certain diagnostic measurements. To achieve the above-mentioned objective, the analyst used Univariate, Multivariate Analysis, and Bagging & Boosting machine learning decision trees models.

Univariate data analysis was used to explore all the variables and provide observations on the distributions of all the relevant variables in the dataset. Besides, Multivariate data analysis was also used to help explore relationships between the important variables in the dataset. Finally, Decision Tree Models (Bagging and Boosting) was used to identify a relationship between the independent variable(s) and the dependent variable to help predict the dependent/target/response variable using the independent/explanatory/regressor. The primary Statistical/ML software used for the analysis was Python.

From the analysis it was observed that the most important factors for prediction diabetes patients include Glucose level, BMI, Age, and Diabetes Pedigree Function. The decision model predicted that a person is less likely to have diabetes if he/she is less than 29 years, glucose level less than 127, BMI less than 32.3, and DPD of less than 0.67 and vice versa. This and more other interesting observations were discovered in the analysis.

image.png

Problem Statement, Business Objectives and Data Description

Content

Diabetes is one of the most frequent diseases worldwide and the number of diabetic patients is growing over the years. The main cause of diabetes remains unknown, yet scientists believe that both genetic factors and environmental lifestyle play a major role in diabetes.

Individuals with diabetes face a risk of developing some secondary health issues such as heart diseases and nerve damage. Thus, early detection and treatment of diabetes can prevent complications and assist in reducing the risk of severe health problems. Even though it's incurable, it can be managed by treatment and medication.

Researchers at the Bio-Solutions lab want to get better understanding of this disease among women and are planning to use machine learning models that will help them to identify patients who are at risk of diabetes.

About the Dataset

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Data Description:

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skinfold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

Pedigree: Diabetes pedigree function - A function that scores likelihood of diabetes based on family history.

Age: Age in years

Outcome: Outcome variable (0: the person is not diabetic or 1: the person is diabetic)

Acknowledgements/Source of Data

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

image.png

Importing all the Relevant Libraries for Data Analysis

image.png

Data Analysis Philosophy

image.png

Reading the Data into a DataFrame

image.png

Part I: Data Overview

image.png

Observation of the DataFrame

The initial steps to get an overview of any dataset is to:

image.png

image.png

Columns/Variable/Features of the Dataset

The data in the tables above contains information of different attributes of diabetes patients based on certain diagnostic measurements. The detailed data dictionary is given below.

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration over 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skinfold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

Pedigree: Diabetes pedigree function - A function that scores likelihood of diabetes based on family history.

Age: Age in years

Outcome: Outcome variable (0: the person is not diabetic or 1: the person is diabetic)

image.png

Dimension of the Dataset

image.png

Data Types of the Dataset

from pandas.core.internals.construction import dataclasses_to_dicts

Observation

image.png

Checking Values Equal to Zero or Less

Observation

image.png

Checking the Missing Values of the Dataset

Observation

image.png

Checking the Duplicates in the Dataset

image.png

Removing Duplicates from the Dataset

There are no duplicates in the data

image.png

Data Sanity Checks: Deep Checking/scrutinity of the the dataset before EDA

Data Types of the Dataset

Overview of Categorical/Object Variables

Observation

image.png

Part II: Exploratory Data Analysis (EDA)

image.png

Numerical Analysis of the Dataset

Statistical summary of the numerical columns in both train and test dataset

Statistical Summary of the Dataset

Observations -

image.png

Graphical Univariate Analysis

image.png

image.png

image.png

image.png

image.png

Detailed Univariate Analysis

Univariate analysis

image.png

Analysis of Pregnancies Variable

image.png

Analysis of Glucose Variable

image.png

Anlaysis of Blood Pressure Variable

image.png

Analysis of Skin Thickness Variable

image.png

Analysis of Insulin Variable

image.png

Analysis of BMI Variable

Observation:

image.png

Analysis of Diabetes Pedigree Function Variable

Analysis of Age Variable

Observations of Diabetes Outcome

Observations of Pregnancies

Part III: EDA - Multivariate Data Analysis

Boxplot Comparison Analysis

image.png

Comparison of the Numerical Columns

image.png

Pregnancies in Relation to Diabetes Outcome

image.png

Glucose Level in Relation to Diabetes Outcome

image.png

Blood Pressure in Relation to Diabetes Outcome

image.png

Skin_Thickness in Relation to Diabetes Outcome

image.png

Insulin in Relation to Diabetes Outcome

image.png

Body Mass Index(BMI) in Relation to Diabetes Outcome

image.png

Diabetes Pedigree Function in Relation to Diabetes Outcome

image.png

Age in Relation to Diabetes Outcome

General Observation

image.png

Correlation and Pairplot Analysis

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Observations-

image.png

Part VI: Data Preparation for Model Building

image.png

Notes

Let's check the count of each unique category in each of the categorical variables.

Outlier Detection and Treatment

Observations

image.png

Data Preparation for modeling

Missing value treatment

Method 2: Missing values treatment

All the zero values will be replaced by the median of the respective variable.

image.png

Split the data into train and test sets

image.png

Model evaluation criterion

The model can make wrong predictions as:

  1. Predicting a person doesn't have diabetes and the person has diabetes.
  2. Predicting a person has diabetes, and the person doesn't have diabetes.

Which case is more important?

Which metric to optimize?

Let's define a function to provide recall scores on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

image.png

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.

image.png

image.png

Part VII:Building Models for Decision Trees, Bagging Boosting

image.png

Building the model

Decision Tree Model

Checking model performance on training set

Observation

The decision tree model is highly overfitting the train dataset.

image.png

Method2: Weighted Decision Tree Model

image.png

Hyperparameter Tuning - Decision Tree

image.png

Hyperparameter Tuning Weighted Decision Tree

Important Features for Predicting Diabetes Outcome

image.png

Bagging Classifier

Some of the important hyperparameters available for bagging classifier are:

Checking model performance on training set

Checking model performance on test set

image.png

Method2: Bagging Classifier - Weighted Decision Tree

Bagging Classifier with weighted decision tree

Hyperparameter Tuning - Bagging Classifier

Bagging Classifier

Some of the important hyperparameters available for bagging classifier are:

Checking model performance on training set

Checking model performance on test set

The model performance has increased but the training data is still overfitting

Method 2: Logistic Regression as the base estimator for Bagging Classifier

Let's try using logistic regression as the base estimator for bagging classifier:

Insights

Tuning Bagging Classifier- Weighte Model

image.png

Random Forest

Random Forest Classifier

Now, let's see if we can get a better model by tuning the random forest classifier. Some of the important hyperparameters available for random forest classifier are:

Checking model performance on training set

Checking model performance on test set

Random Forest -Weighted Class

Random forest with class weights

Hyperparameter Tuning - Random Forest

Let's try using class_weights for random forest:

Checking model performance on training set

Checking model performance on test set

Method 2

Class_Weights for Random Forest - Hyperparameter Tuning

Let's try using class_weights for random forest:

Important Features for Predicting Diabetes - Random Forest Model

image.png

The hyperparameter tuning on Random forest reduces the overfitting

image.png

Summary Performance Measures of Bagging Models Train vs Test Models


Summary Performance Measures of Bagging Models Tuning Train vs Tuning Test Models


image.png

Boosting Decision Tree Models

AdaBoost Classifier

Checking model performance on training set

Checking model performance on test set

image.png

Hyperparameter Tuning - AdaBoost Classifier

Checking model performance on training set

Checking model performance on test set

Important Features for Predicting Diabetes Outcome - AdaBoost Classifier

image.png

Gradient Boosting Classifier

Checking model performance on training set

Checking model performance on test set

image.png

Hyperparameter Tuning - Gradient Boosting Classifier

Checking model performance on training set

Checking model performance on test set

Performance of Gradient Boster remains the same after hyperparameter tuning

Important Features for Predicting Diabetes Outcome - Gradient Boost Classifier

image.png

XGBoost Classifier

XGBoost has many hyper parameters which can be tuned to increase the model performance. Some of the important parameters are:

Checking model performance on training set

Checking model performance on test set

image.png

Hyperparameter Tuning - XGBoost Classifier

Checking model performance on training set

Checking model performance on test set

Important Features for Prediction Diabetes Outcome - XGBoost Classifier

image.png

Stacking Classifier

Checking model performance on training set

Checking model performance on test set

image.png

Summary Performance Comparison Measures of Boosting Train vs Test

image.png

Summary Performance Comparison Measures of Boosting Tuned Training vs Tuned Testing

image.png

Summary Performance Measures of all Bagging and Boosting: All Training Models

image.png

Summary Performance Measures of all Bagging and Boosting: All Testing Models

Observations:

image.png

Important features of the final selected model

Selected Model - Tunned Random Forest

Important Features for Predicting Diabeties Outcome - Tuned AdaBoost Classifier

Second Model Selection- Gradient Boost Classifier

image.png

Using Decision Tree for Prediction Diabetes Outcome

General Models Observations


image.png

Busines Recommendations

image.png

image.png

image.png